Integrating Clustering and Multi-Document Summarization by Bi-Mixture Probabilistic Latent Semantic Analysis (PLSA) with Sentence Bases

نویسندگان

  • Chao Shen
  • Tao Li
  • Chris H. Q. Ding
چکیده

Probabilistic Latent Semantic Analysis (PLSA) has been popularly used in document analysis. However, as it is currently formulated, PLSA strictly requires the number of word latent classes to be equal to the number of document latent classes. In this paper, we propose Bi-mixture PLSA, a new formulation of PLSA that allows the number of latent word classes to be different from the number of latent document classes. We further extend Bi-mixture PLSA to incorporate the sentence information, and propose Bi-mixture PLSAwith sentence bases (Bi-PLSAS) to simultaneously cluster and summarize the documents utilizing the mutual influence of the document clustering and summarization procedures. Experiments on real-world datasets demonstrate the effectiveness of our proposed methods. Introduction Document clustering and multi-document summarization are two fundamental tools for understanding document data. Probabilistic Latent Semantic Analysis is a widely used method for document clustering due to the simplicity of the formulation, and efficiency of its EM-style computational algorithm. The simplicity makes it easy to incorporate PLSA into other machine learning formulations. There are many further developments of PLSA, such as Latent Dirichlet Allocation (Blei, Ng, and Jordan 2003) and other topic models see review articles (Steyvers and Griffiths 2007; Blei and Lafferty 2009). The essential formulation of PLSA is the expansion of the co-occurrence probability P (word, doc) into a latent class variable z that separates word distributions from the document distributions given latent class. However, as it is currently formulated, PLSA strictly requires the number of word latent classes to be equal to the number of document latent classes (i.e., there is a one-to-one correspondence between word clusters and document clusters). In practical applications, however, this strict requirement may not be satisfied since if we consider documents and words as two different types of objects, they may have their own cluster structures, which are not necessarily same, though related. Copyright c © 2011, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Recently, an extension of PLSA, called “Factorization by Given Bases”(FGB), is proposed to simultaneously cluster and summarize documents by making use of both the document-term and sentence-term matrices (Wang et al. 2008b). By formulating the clustering-summarization problem as a problem of minimizing the Kullback-Leibler divergence between the given documents and the model reconstructed terms, the model essentially performs co-clustering on document and sentences. However, one limitation in the model is that the number of document clusters is equal to the number of sentence clusters.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Personalized Multi-Document Summarization using N-Gram Topic Model Fusion

We consider the problem of probabilistic topic modeling for query-focused multi-document summarization. Rather than modeling topics as distributions over a vocabulary of terms, we extend the probabilistic latent semantic analysis (PLSA) approach with a bigram language model. This allows us to relax the conditional independence assumption between words made by standard topic models. We present a...

متن کامل

Decayed DivRank for Guided Summarization

Guided summarization is essentially an aspect-based multi-document summarization, where aspects can be taken as specified queries in summarization. We proposed a novel ranking algorithm, Decayed DivRank (DDRank) for guided summarization tasks of TAC2011. DDRank can address relevance, importance, diversity, and novelty simultaneously through a decayed vertex-reinforced random walk process in sen...

متن کامل

Two-tier Architecture for Domain Specific Document Summarization Using Probabilistic Latent Semantic Analysis

In this research work we have proposed two-tier architecture for document summarization. This architecture minimizes the redundancy and boosts the information relevancy in the summary by applying Probabilistic Latent Semantic Analysis (PLSA) at two levels. It also enhances the summarizer’s speed by using Incremental Expectation Maximization algorithm for PLSA learning rather than Expectation Ma...

متن کامل

Spoken Lecture Summarization by Random Walk over a Graph Constructed with Automatically Extracted Key Terms

This paper proposes an improved approach for spoken lecture summarization, in which random walk is performed on a graph constructed with automatically extracted key terms and probabilistic latent semantic analysis (PLSA). Each sentence of the document is represented as a node of the graph and the edge between two nodes is weighted by the topical similarity between the two sentences. The basic i...

متن کامل

Multi-document Summarization using Probabilistic Topic-based Network Models

Multi-document summarization has obtained much attention in the research domain of text summarization. In the past, probabilistic topic models and network models have been leveraged to generate summaries. However, previous studies do not investigate different combinations of various topic models and network models. This paper describes an integrated approach considering both probabilistic topic...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011